25 research outputs found

    Detection of global state predicates

    Get PDF
    The problem addressed here arises in the context of Meta: how can a set of processes monitor the state of a distributed application in a consistent manner? For example, consider the simple distributed application as shown here. Each of the three processes in the application has a light, and the control processes would each like to take an action when some specified subset of the lights are on. The application processes are instrumented with stubs that determine when the process turns its lights on or off. This information is disseminated to the control processes, each of which then determines when its condition of interest is met. Meta is built on top of the ISIS toolkit, and so we first built the sensor dissemination mechanism using atomic broadcast. Atomic broadcast guarantees that all recipients receive the messages in the same order and that this order is consistent with causality. Unfortunately, the control processes are somewhat limited in what they can deduce when they find that their condition of interest holds

    Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory

    Get PDF
    We propose an axiomatic generic framework for modelling weak memory. We show how to instantiate this framework for SC, TSO, C++ restricted to release-acquire atomics, and Power. For Power, we compare our model to a preceding operational model in which we found a flaw. To do so, we define an operational model that we show equivalent to our axiomatic model. We also propose a model for ARM. Our testing on this architecture revealed a behaviour later acknowl-edged as a bug by ARM, and more recently 31 additional anomalies. We offer a new simulation tool, called herd, which allows the user to specify the model of his choice in a concise way. Given a specification of a model, the tool becomes a simulator for that model. The tool relies on an axiomatic description; this choice allows us to outperform all previous simulation tools. Additionally, we confirm that verification time is vastly improved, in the case of bounded model checking. Finally, we put our models in perspective, in the light of empirical data obtained by analysing the C and C++ code of a Debian Linux distribution. We present our new analysis tool, called mole, which explores a piece of code to find the weak memory idioms that it uses

    Detection of Global State Predicates

    No full text
    those of the authors and should not be construed as an official Department of Defense position, policy, or decision. Author's address: Cornell Universit

    Distributed Consensus Revisited

    No full text
    Distributed Consensus is a classical problem in distributed computing. It requires the correct processors in a distributed system to agree on a common value despite the failure of other processors. This problem is closely related to other problems, such as Byzantine Generals, Approximate Agreement, and k-Set Agreement. This paper examines a variant of Distributed Consensus that considers agreement on a value that is more than a single bit and requires that the agreed upon value be one of the correct processors' input values. It shows that, for this problem to be solved in a system with arbitrary failures, it is necessary that more processors remain correct than for solutions to Distributed Consensus and for cases where agreement is only a single bit. Specifically, the number of processors that must be correct is a function of the size of the domain of values used. Two existing consensus algorithms are modified to solve this stronger variant

    Space-Efficient Atomic Snapshots in Synchronous Systems

    No full text
    We consider the problem of implementing an atomic snapshot memory in synchronous distributed systems. An atomic snapshot memory is an array of memory locations, one per processor. Each processor may update its own location or scan all locations atomically. We are interested in implementations that are space-efficient in the sense that they are honest. This means that the implementation may use no more shared memory than that of the array being implemented and that the memory truly reflect the contents of that array. If n is the number of processors involved, then the worst-case scanning time must be at least n. We show that the sum of the worst-case update and scanning times must be greater than floor(3n/2). We exhibit two honest implementations. One has scans and updates with worst-case times of n+1 for both operations; for scans, this is near the lower bound. The other requires longer scans (with worst-case time ceiling(3n/2)+1) but shorter updates (with worst-case time ceiling(n/2)+1). Thus, both implementations have the sum of the worst-case times at 2n + O(1), which is within n/2 of the lower bound. Closing the gap between these algorithms and the combined lower bound remains an open problem

    Automatically increasing the fault-tolerance of distributed algorithms

    No full text
    The design of fault-tolerant distributed systems is a costly and diflicult task. Its cost and difficulty increase dramatically with the severity of failures that a system must tolerate. We seek to simplify this task by developing methods to automatically translate protocols tolerant of “benign ” failures to ones tolerant of more “severe” failures. This paper describes two new translation mechanisms for qr~hronous systems; one translates programs tolerant of crash failures into programs tolerant of general omission failures, and the other translates from gene& omiesion failures to arbitrary failures. Together these can be used to translate any program tolerant of the most benign failures to a program tolerant of the most severe.

    Simplifying Fault-Tolerance: Providing the Abstraction of Crash Failures

    No full text
    The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Such translations can be quantified by two measures: fault-tolerance, which is a measure of how many processors must remain nonfaulty for the translation to be correct, and round-complexity, which is a measure of how the translation increases the running time of an algorithm. Understanding these translations and their limitations with respect to these measures can provide insight into the relative impact of different models of faulty behavior on the ability to provide fault-tolerant applications. This paper considers two classes of translations from crash failures to each of the following types of more severe failures: omission to send messages; omission to send and receive messages; and totally arbitrary behavior. It shows that previously developed translations to send-omission failures are optimal with respect to both fault-tolerance and round-complexity. It exhibits a hierarchy of translations to general (send/receive) omissions that improves upon the fault-tolerance of previously developed translations. It also gives a series of translations to arbitrary failures that improves upon the round-complexity of previously developed translations. All translations developed in this paper are shown to be optimal in that they cannot be improved with respect to one measure without negatively affecting the other; that is, both hierarchies of translations are matched by corresponding hierarchies of impossibility results

    Using Knowledge to Optimally Achieve Coordination in Distributed Systems

    No full text
    The problem of coordinating the actions of individual processors is fundamental in distributed computing. Researchers have long endeavored to find efficient solutions to a variety of problems involving coordination. Recently, processor knowledge has been used to characterize such solutions and to derive more efficient ones. Most of this work has concentrated on the relationship between common knowledge and simultaneous coordination. This paper takes an alternative approach, considering problems in which coordinated actions need not be performed simultaneously. This approach permits better understanding of the relationship between knowledge and the different requirements of coordination problems. This paper defines the ideas of optimal and optimum solutions to a coordination problem and precisely characterizes the problems for which optimum solutions exist. This characterization is based on combinations of eventual common knowledge and continual common knowledge. The paper then considers more general problems, for which optimal, but no optimum, solutions exist. It defines a new form of knowledge, called extended common knowledge, which combines eventual and continual knowledge, and shows how extended common knowledge can be used to both characterize and construct optimal protocols for coordination

    The Complexity of Almost-Optimal Coordination

    No full text
    The problem of fault-tolerant coordination is fundamental in distributed computing. In the past, researchers have considered the complexity of achieving optimal simultaneous coordination under various failure assumptions. This paper studies the complexity of achieving simultaneous coordination in synchronous systems in the presence of send/receive omission failures. It had been shown earlier that achieving optimal simultaneous coordination in these systems requires NP-hard local computation. In this paper, we study almost-optimal coordination, which requires processors to coordinate within a constant additive or multiplicative number of rounds of the coordination time of an optimal protocol. We show that achieving almost-optimal coordination also requires NP-hard computation
    corecore